1. The Data Set

1.1. Procuring the data

In order to study the ways in which our country’s weather has behaved in the past years, we decided to look for a data set on different sources from the Internet. Unfortunately, we have not found any data set containing the metrics we need, therefore we decided to use the Visual Crossing Weather API, which would give us the weather conditions in Romania in a set of locations and time periods we chose to specify.

The period of time has been set to a daily basis from Jan 1st 2011, until Dec 31st 2021. This choice has been made due to financial limitations we encountered on the API’s part. In case that we had wanted to procure more data, more requests would have been necessary to be made to the public API, exceeding the threshold of free requests.

For each request, the API will return a single .csv file, respecting the name convention weather_YEAR_COUNTY.csv. So, for 41 Romanian counties and 11 years, we would obtain 451 csv files. However, we need to work with a single data set, so we decided to merge all of them into a single one in the code block below.

library(tidyverse)
library(ggplot2)
library(plotly)

# TRUE only at first when "weather_2011-2021_Romania.csv" is not present in our project
MERGE_ALL_DATASETS = FALSE

if (MERGE_ALL_DATASETS) {
  
  filenames_list = list.files(
    path = "../weather_data/",
    pattern = "*.csv",
    full.names = TRUE
  )
  
  data = lapply(filenames_list, read_csv) %>% bind_rows()
  
  write.csv(data, "weather_2011-2021_Romania.csv", row.names = FALSE, na = "")
} 

data = read_csv("weather_2011-2021_Romania.csv", show_col_types = FALSE)
rm(MERGE_ALL_DATASETS)

1.2. The data set’s columns

The data set that we have created contains 25 columns, out of which the most relevant (used in our EDA) are described in the table below. Note that to these columns, many others will be added during the data cleaning process.

For complete details related to the data given by the API, please refer to this page.

Column Name Description Observations
Address The address used by the API for the location from where the weather should be registered Usually of form COUNTY_SEAT_NAME, Romania.
Date time The date for when the weather is registered The weather associated to a day represents the average of 24 hourly collected data.
Minimum Temperature Minimum temperature over the day Measured in °C.
Maximum Temperature Maximum temperature over the day Measured in °C.
Temperature Average temperature over the day Measured in °C.
Wind Speed 2 minute average of wind speed Measured in Km/h.
Wind Direction 2 minute average of wind direction Measured as a number in the range 0-360, representing the degrees count on the trigonometric circle.
Precipitation Amount of liquid equivalent precipitation (rain, snow etc.) Measured in millimeters.
Precipitation Cover Percentage of time where measurable precipitation occurred Ranges from 0 to 100.
Snow Depth Depth of snow on ground Measured in centimeters.
Visibility Distance that can be viewed Measured in km.
Cloud Cover Percentage of sky that is covered by cloud Ranges from 0 to 100.
Weather Type Weather types reported by weather station String of multiple weather types separated by commas.
Latitude The latitude of the address It is a floating number.
Longitude The longitude of the address It is a floating number.
Conditions Weather sky conditions Same format as Weather Types, but with much lesser options.

2. Cleaning the Data Set

The data set we obtained is mostly in a good shape. However, on a quick scan, we identified a few aspects that required some attention before diving into the exploratory analysis (e.g. splitting complex columns into multiple simpler ones).

In order to add value to the data set, we also decided to do some feature engineering, and create additional columns that would enhance the diversity of the data types. This diversity will allow for creating new types of plots and open multiple questions with possibly more surprising insights in the Romanian weather.

We start by removing blank spaces from column names for easier writing of the R code (e.g. “Minimum Temperature” becomes “MinimumTemperature”) and redundant columns ResolvedAddress and Name.

names(data) = gsub(' ', '', names(data))
data = data %>% select(-c("ResolvedAddress", "Name"))

2.1. Splitting the Weather Type & Conditions columns

The data set contains two columns named Weather Type and Conditions that contain strings of different weather conditions identified at a specific time and location, each condition being separated by a comma. However, there can be multiple conditions reported in a day, so we decided that, instead of having a single column describing all of the conditions, we could create a separate column for each weather type and condition.

As a result, all of these new columns will be of boolean type, describing whether the weather has been or not identified in the given condition. There are 37 possible weather types and 4 possible conditions in the data set, therefore 41 new columns will be created.

accumulate_weather_func = function(accumulated_types, weather) {
  if (!is.na(weather)) {
    conditions = strsplit(weather, ", ")[[1]]
    accumulated_types = union(accumulated_types, conditions)
  }
  
  return(accumulated_types)
}

# Compute the list of possible weather types in this data set
weather_types = reduce(data$WeatherType, accumulate_weather_func, .init = c())

cat("Number of possible weather types:", length(weather_types))
## Number of possible weather types: 37
# Create the weather type columns
for (weather_type in weather_types) {
  column_name = gsub(' ', '', weather_type)  # remove spaces
  column_name = gsub('/', 'Or', column_name)  # replace slashes with Or
  
  data[[column_name]] = with(
    data, 
    # If Weather is NA, then new column will contain NA as well
    if_else(
      is.na(WeatherType),
      NA,
      # If Weather is defined, check that condition is present => set TRUE/FALSE
      if_else(
        grepl(weather_type, WeatherType, fixed = TRUE),
        TRUE,
        FALSE
      )
    )
  )
}

rm(column_name, weather_type)

# Display an example of some weather columns
data %>%
  slice(42:46) %>%
  select(c(WeatherType, Fog, LightRain, SnowShowers, Duststorm))
## # A tibble: 5 x 5
##   WeatherType                              Fog   LightRain SnowShowers Duststorm
##   <chr>                                    <lgl> <lgl>     <lgl>       <lgl>    
## 1 Mist, Fog, Light Rain                    TRUE  TRUE      FALSE       FALSE    
## 2 Snow Showers, Mist, Rain, Rain Showers,~ FALSE TRUE      TRUE        FALSE    
## 3 Mist                                     FALSE FALSE     FALSE       FALSE    
## 4 <NA>                                     NA    NA        NA          NA       
## 5 Mist                                     FALSE FALSE     FALSE       FALSE
# Compute the list of possible conditions in this data set
conditions = reduce(data$Conditions, accumulate_weather_func, .init = c())

cat("Number of possible conditions:", length(conditions))
## Number of possible conditions: 4
# Create the conditions columns
for (condition in conditions) {
  column_name = paste0("Condition", condition)
  
  if (condition == "Partially cloudy") {
    column_name = "ConditionPartiallyCloudy"
  }
  
  data[[column_name]] = with(
    data, 
    # If Condition is NA, then new column will contain NA as well
    if_else(
      is.na(Conditions),
      NA,
      # If Condition is defined, check that condition is present => set TRUE/FALSE
      if_else(
        grepl(condition, Conditions, fixed = TRUE),
        TRUE,
        FALSE
      )
    )
  )
}

rm(column_name, condition)

# Display an example of the conditions columns
data %>%
  slice(11:15) %>%
  select(c(Conditions, ConditionClear, ConditionPartiallyCloudy, ConditionOvercast, ConditionRain))
## # A tibble: 5 x 5
##   Conditions      ConditionClear ConditionPartia~ ConditionOverca~ ConditionRain
##   <chr>           <lgl>          <lgl>            <lgl>            <lgl>        
## 1 Clear           TRUE           FALSE            FALSE            FALSE        
## 2 Partially clou~ FALSE          TRUE             FALSE            FALSE        
## 3 Overcast        FALSE          FALSE            TRUE             FALSE        
## 4 Overcast        FALSE          FALSE            TRUE             FALSE        
## 5 Rain, Overcast  FALSE          FALSE            TRUE             TRUE

2.2. Categorizing the Wind Direction

As the data is currently offered, the wind direction is expressed as a real number, whose value ranges between 0 and 360 degrees. We can express this value under a categorical form, by assigning a named direction for each specific division on the trigonometric circle. In this way, we can obtain 16 different categories for 16 subdivisions, as can be checked in the figure below.

wind_direction_table = data.frame(
  "Cardinal_Point" = c("N", "NNE", "NE", "ENE", "E", "ESE", "SE", "SSE", "S", "SSW", "SW", "WSW", "W", "WNW", "NW", "NNW"),
  "From" = c(348.75, seq(11.25, 326.25, 22.5)),
  "To" = seq(11.25, 348.75, 22.5)
)

Using the above table, we can take each value from the WindDirection column and build the new column WindCardinalDirection.

# Break down the numeric direction into more categories
data = data %>%
  mutate(
    WindCardinalDirection = cut(
      x = WindDirection,
      breaks = c(0, wind_direction_table$To, 360),
      labels = c(wind_direction_table$Cardinal_Point, "N")
    )
  )

# Display an example of some wind directions
data %>%
  slice_sample(n=5) %>%
  select(c(WindDirection, WindCardinalDirection))
## # A tibble: 5 x 2
##   WindDirection WindCardinalDirection
##           <dbl> <fct>                
## 1          243. WSW                  
## 2          141. SE                   
## 3          193. SSW                  
## 4          267. W                    
## 5          208. SSW

2.3. Categorizing the Visibility

The data set presents a column called Visiblity that represents the distance, measured in kilometers, at which an object or light can be clearly discerned.

As mentioned by the National Weather Service at this location, we can break down these distances into 4 sets, creating a new column called VisibilityLevel, which describes an ordered categorical variable.

The 4 divisions are described as being the following:

  • Very Poor - Less than 0.5 nautical miles (< 0.926 Km)
  • Poor - 0.5 to less than 2 nautical miles (0.926 Km - 3.704 Km)
  • Moderate - 2 to 5 nautical miles (3.704 Km - 9.26 Km)
  • Good - Greater than 5 nautical miles (> 9.26 Km)
data = data %>%
  mutate(
    VisibilityLevel = cut(
      x = Visibility,
      breaks = c(0, 0.926, 3.704, 9.26, +Inf),
      labels = c("Very Poor", "Poor", "Moderate", "Good"),
      include.lowest = TRUE
    )
  )
  
# Display an example of some visibility levels
data %>%
  slice_sample(n=5) %>%
  select(c(Visibility, VisibilityLevel))
## # A tibble: 5 x 2
##   Visibility VisibilityLevel
##        <dbl> <fct>          
## 1        9   Moderate       
## 2       11.2 Good           
## 3       12.3 Good           
## 4       12.8 Good           
## 5        8.2 Moderate

2.4. Clean the Counties

2.4.1. Rewriting the County names

The column Address contains the name of each county seat in our country, followed by , Romania, but we would like to have a column called County that contains only the name of the county (not the county seat) to use while plotting the Romania’s map. It is important that the county names computed here to be the same as the ones employed in the geoJson file containing data about drawing the counties’ borders (e.g. must rename Bucharest to Bucuresti).

In addition to this, we also create a new column Mnemonic which can be used in plotting interactive maps.

romania_counties = c("Alba","Arad","Arges","Bacau","Bihor","Bistrita-Nasaud","Botosani","Brasov","Braila","Bucuresti","Buzau","Caras-Severin","Calarasi","Cluj","Constanta","Covasna","Dambovita","Dolj","Galati","Giurgiu","Gorj","Harghita","Hunedoara","Ialomita","Iasi","Maramures","Mehedinti","Mures","Neamt","Olt","Prahova","Satu Mare","Salaj","Sibiu","Suceava","Teleorman","Timis","Tulcea","Vaslui","Valcea","Vrancea")
romania_county_seats = c("Alba Iulia","Arad","Pitesti","Bacau","Oradea","Bistrita","Botosani","Brasov","Braila","Bucuresti","Buzau","Resita","Calarasi","Cluj-Napoca","Constanta","Sfantu Gheorghe","Targoviste","Craiova","Galati","Giurgiu","Targu Jiu","Miercurea Ciuc","Deva","Slobozia","Iasi","Baia Mare","Drobeta-Turnu Severin","Targu Mures","Piatra Neamt","Slatina","Ploiesti","Satu Mare","Zalau","Sibiu","Suceava","Alexandria","Timisoara","Tulcea","Vaslui","Ramnicu Valcea","Focsani")
romania_mnemonics = c("AB","AR","AG","BC","BH","BN","BT","BV","BR","B","BZ","CS","CL","CJ","CT","CV","DB","DJ","GL","GR","GJ","HR","HD","IL","IS","MM","MH","MS","NT","OT","PH","SM","SJ","SB","SV","TR","TM","TL","VS","VL","VN")

# First create a column CountySeat, translate the special Bucharest case,
# and then, based on the above lists, create the final column County
data = data %>%
  mutate(CountySeat = gsub(', Romania', '', Address)) %>%
  mutate(CountySeat = ifelse(CountySeat == "Bihor", "Oradea", CountySeat)) %>%
  mutate(
    CountySeat = ifelse(
      CountySeat == "Bucharest",
      "Bucuresti",
      CountySeat
    )
  ) %>%
  mutate(
    County = plyr::mapvalues(
      x = CountySeat,
      from = romania_county_seats,
      to = romania_counties
    ),
    Mnemonic = plyr::mapvalues(
      x = CountySeat,
      from = romania_county_seats,
      to = romania_mnemonics
    )
  )

# Display an example of several counties
data %>%
  slice_sample(n=5) %>%
  select(c(Address, CountySeat, County, Mnemonic))
## # A tibble: 5 x 4
##   Address             CountySeat County          Mnemonic
##   <chr>               <chr>      <chr>           <chr>   
## 1 Bistrita, Romania   Bistrita   Bistrita-Nasaud BN      
## 2 Alba Iulia, Romania Alba Iulia Alba            AB      
## 3 Zalau, Romania      Zalau      Salaj           SJ      
## 4 Iasi, Romania       Iasi       Iasi            IS      
## 5 Ploiesti, Romania   Ploiesti   Prahova         PH

2.4.2. Complete Ilfov’s weather

We were not sure what the API will return when calling it to get Ilfov’s weather (might have returned Bucharest’s weather, which is inside it), so we decided to pass this county’s API calls in order to allow for other free calls for other counties.

In this project we will suppose that Ilfov and Bucharest share the same weather.

ilfov_data = data %>%
  filter(County == "Bucuresti") %>%
  mutate(
    Address = "Ilfov, Romania",
    County = "Ilfov",
    Mnemonic = "IF",
    CountySeat = "Bucuresti",
    Latitude = 44.5355,
    Longitude = 26.2324
  )

data = rbind(data, ilfov_data)

rm(ilfov_data)

romania_counties = append(romania_counties, "Ilfov")
romania_county_seats = append(romania_county_seats, "Bucuresti") 
romania_mnemonics = append(romania_mnemonics, "IF")

2.5. Categorizing Time

For some of our plots we detected a need to compare different seasons of the year, so we decided to create a new column specifically for this scope.

compute_season_from_date = function(date_col) {
  month_col = substr(date_col, 1, 2)

  seasons_col = map(month_col, function(month) {
    if (month %in% c("03", "04", "05")) {
      return("Spring")
    }
    
    if (month %in% c("06", "07", "08")) {
      return("Summer")
    }
    
    if (month %in% c("09", "10", "11")) {
      return("Autumn")
    }
    
    return("Winter")
  })
  
  return(as.character(seasons_col))
}

data = data %>%
  mutate(Season = compute_season_from_date(Datetime))

…as well as for the month…

compute_month_from_date = function(date_col, month_pos = 1) {
  month_col = substr(date_col, month_pos, month_pos + 1)
  mapping = list(
    "01" = "January",
    "02" = "February",
    "03" = "March",
    "04" = "April",
    "05" = "May",
    "06" = "June",
    "07" = "July",
    "08" = "August",
    "09" = "September",
    "10" = "October",
    "11" = "November",
    "12" = "December"
  )
  
  return(as.character(mapping[month_col]))
}

months_names = c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December")

data = data %>%
  mutate(Month = compute_month_from_date(Datetime)) %>%
  mutate(Month = factor(Month, levels = months_names))

…and year…

compute_year_from_date = function(date_col, year_pos = 7) {
  year_col = substr(date_col, year_pos, year_pos + 3)
  return(as.character(year_col))
}

data = data %>%
  mutate(Year = compute_year_from_date(Datetime)) %>%
  mutate(Year = factor(Year))

3. Breaking down the Weather

3.1. Exploring the Weather Types

First we define the data set portion that we are going to use in this analysis.

# Select relevant columns for this analysis.
data_wt_cond = data %>% 
  select(
    "Season", "County", "Mnemonic", "Latitude", "Longitude",
     "Mist":"HeavyFreezingRain",
     "ConditionPartiallyCloudy":"ConditionRain"
    )
wt_cols = 6:42
cond_cols = 43:46

3.1.1. Unusual Weather Types

There is a wide variety of weather types that can be identified in Romania during the years’ four seasons. We have identified 37 special ones, as labeled by the Visual Crossing Weather API, but what is the most common unusual weather in Romania? We could first take a look at the exact number of occurrences for each weather type.

# Create data frame for the first plot
data_wt_cond_1 = data_wt_cond %>%
  summarise(across(all_of(wt_cols), sum, na.rm = TRUE)) %>%
  gather(key = "WeatherType", value = "Occurrences") %>%
  arrange(Occurrences) %>%
  mutate(WeatherType = factor(WeatherType, WeatherType))

total_occurrences = (data_wt_cond_1 %>% summarise(sum(Occurrences)))[[1]]

data_wt_cond_1 = data_wt_cond_1 %>%
  mutate(OccurrencesFreq = round(Occurrences / total_occurrences * 100, 3))

# Define text label function
get_occ_text = function(type_col, occ_col, occ_freq_col) {
  texts = list()
  
  for (i in 1:length(occ_col)) {
    text = paste0(
      "Weather: ", type_col[i], "\n",
      occ_freq_col[i], "% of Unusual Weather\n",
      occ_col[i], " observed occurrences"
    )
    texts = append(texts, text)
  }
  return(texts)
}

plot_wt_cond_1 = data_wt_cond_1 %>%
  ggplot(aes(
    x = WeatherType,
    y = OccurrencesFreq,
    text = get_occ_text(WeatherType, Occurrences, OccurrencesFreq)
  )) +
  geom_segment(
    aes(x = WeatherType, xend = WeatherType, y = 0, yend = OccurrencesFreq),
    color = "#82a3e0",
    alpha = 1
  ) +
  geom_point(color="#82a3e0", size=2, alpha=1) +
  coord_flip() +
  labs(
    title = "Frequency of Unusual Weather Conditions in Romania",
    x = "",
    y = "% of unusual weather occurrences"
  )

rm(data_wt_cond_1, total_occurrences)

ggplotly(plot_wt_cond_1, tooltip = "text")
rm(plot_wt_cond_1)

The lollipop plot from above tells us that the Mist weather has the biggest amount of occurrences during 2011-2021 when summing up all the Romania’s counties. This might be explained by the counties found in the Carpathian Mountains where this type of weather is usually more manifested, together with Rain, Fog, and Snow.

In order to confirm this hypothesis, we decided to plot the most unusual weather for each county and check the results in the following map.

library(geojsonio)
library(broom)

romania_spdf = geojson_read("romania.geojson", what = "sp")
romania_map = tidy(romania_spdf, region = "name")

data_wt_cond_2 = data_wt_cond %>%
  select(c(County, all_of(wt_cols))) %>%
  group_by(County) %>%
  summarise(across(1:37, sum, na.rm = TRUE)) %>%
  gather(key = "WeatherType", value = "Occurrences", -County) %>%
  group_by(County) %>%
  filter(Occurrences == max(Occurrences)) %>%
  mutate(WeatherType = factor(WeatherType, WeatherType)) %>%
  select(-Occurrences) %>%
  ungroup()

map_wt_cond_2 = romania_map %>%
  left_join(. , data_wt_cond_2, by = c("id" = "County"))

ggplot() +
  geom_polygon(
    data = map_wt_cond_2,
    aes(fill = WeatherType, x = long, y = lat, group = group),
    color="#dbdbdb"
  ) +
  coord_map() +
  theme_void() +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_fill_manual(
    values = c("#b1b3c7", "#848bc4", "#888a99", "#343873"),
    labels = c("Mist", "Sky Coverage Increasing", "Fog", "Rain")
  ) +
  labs(
    title = "Most Unusual Weather Conditions in Romania",
    fill = "Weather Type"
  )

rm(romania_map, data_wt_cond_2, map_wt_cond_2)

As it is the case here, it seems that the Mist weather is encountered mostly in the southern and eastern parts of the country, while the Oriental Carpathians present a Fog weather, with mostly Rainy days in Mures and Sibiu, inside the Transylvanian Plateau. Note that Mist is also a type of Fog, but one that contains less condensed water drops.

3.1.2. Other Weather Conditions

Here we look into the 4 conditions the data set has to offer related to the Conditions column. In the pie chart below you can see the overall distribution of the weather conditions in the entire period among all the Romania’s counties, and it looks like the most part of the weather is Partially Cloudy.

data_wt_cond_3 = data_wt_cond %>%
  summarise(across(all_of(cond_cols), sum, na.rm = TRUE)) %>%
  gather(key = "Condition", value = "Occurrences") %>%
  arrange(desc(Occurrences)) %>%
  mutate(Condition = factor(Condition, Condition)) %>%
  mutate(ypos = cumsum(Occurrences) - 0.5 * Occurrences )

total_occurrences = (data_wt_cond_3 %>% summarise(sum(Occurrences)))[[1]]

data_wt_cond_3 = data_wt_cond_3 %>%
  mutate(OccurrencesFreq = round(Occurrences / total_occurrences * 100, 2))

data_wt_cond_3 %>%
  ggplot(aes(x = "", y = Occurrences, fill = Condition)) +
  geom_bar(stat = "identity", width = 1, color="white") +
  coord_polar("y", start = 0) +
  theme_void() +
  theme(plot.title = element_text(hjust = 0.5)) +
  labs(title = "Distribution of registered Sky Conditions") +
  scale_fill_manual(
    values = c("#0c9fc7", "#041b8f", "#02d7e3", "#363e66"),
    labels = c("Partially Cloudy", "Rain", "Clear", "Overcast")
  ) +
  geom_text(
    aes(y = total_occurrences - ypos, label = paste0(OccurrencesFreq, "%")),
    color = "white",
    size = 6
  )

rm(data_wt_cond_3, total_occurrences)

The next part would be to plot these conditions on the Romania’s map, however, we decided to take a step further, and plot their distributions among the four different seasons. Down below you can observe the most observed weather condition for each county during spring, summer, autumn, and winter time.

library(leaflet)

data_wt_cond_4 = data_wt_cond %>%
  select(c(Season, County, Mnemonic, all_of(cond_cols))) %>%
  group_by(Season, County, Mnemonic) %>%
  summarise(across(1:4, sum, na.rm = TRUE)) %>%
  gather(key = "SkyCondition", value = "Occurrences", -c(County, Season, Mnemonic)) %>%
  group_by(Season, County, Mnemonic) %>%
  filter(Occurrences == max(Occurrences)) %>%
  mutate(SkyCondition = factor(SkyCondition, SkyCondition)) %>%
  ungroup()

# Surprisingly, during Winter, Timis has both Rain and Partialy Cloudy as the most
# encountered conditions (both with 140 occurences), so we must pick one to display.
# We chose Partially Cloudy, since it looks like this is the trend in Romania
data_wt_cond_4 = data_wt_cond_4 %>%
  filter(County != "Timis" | Season != "Winter" | SkyCondition != "ConditionRain")

convert_cond_col_to_text = function(cond_col) {
  from_arr = c("ConditionClear", "ConditionPartiallyCloudy", "ConditionOvercast", "ConditionRain")
  to_arr = c("Clear", "Partially Cloudy", "Overcast", "Rain")

  return(plyr::mapvalues(
    x = cond_col,
    from = from_arr,
    to = to_arr
  ))
}

plot_conditions_distribution_for_season = function(season_name, season_palette) {
  data_wt_cond_season_4 = data_wt_cond_4 %>%
    filter(Season == season_name) %>%
    select(County, SkyCondition)

  map_wt_season_4 = romania_spdf
  
  map_wt_season_4@data = map_wt_season_4@data %>%
    left_join(data_wt_cond_season_4, by = c("name" = "County"))
  
  cond_palette = colorFactor(
    palette = season_palette,
    levels = c("ConditionClear", "ConditionPartiallyCloudy", "ConditionOvercast", "ConditionRain"),
    domain = map_wt_season_4@data$SkyCondition,
    na.color = "transparent"
  )
  
  season_labels = sprintf(
    "<strong>%s</strong><br/>Condition: <em>%s</em>",
    map_wt_season_4$name,
    convert_cond_col_to_text(map_wt_season_4$SkyCondition)  
  ) %>% lapply(htmltools::HTML)
  
  
  plot = leaflet(
      map_wt_season_4,
      options = leafletOptions(zoomSnap = 0.5, zoomDelta = 0.5)
    ) %>%
    setView(25, 46, 6.5) %>%
    addProviderTiles(providers$OpenStreetMap.HOT) %>%
    addPolygons(
      fillColor = ~cond_palette(SkyCondition),
      opacity = 1,
      color = "#dbdbdb",
      weight = 2,
      fillOpacity = 0.7,
      highlightOptions = highlightOptions(
        weight = 4,
        color = "#6b6b6b",
        fillOpacity = 0.7,
        bringToFront = TRUE
      ),
      label = season_labels,
      labelOptions = labelOptions(
        style = list("font-weight" = "normal", padding = "3px 8px"),
        textsize = "15px",
        direction = "auto"
      )
    ) %>%
    addLegend(
      pal = cond_palette,
      values = ~SkyCondition,
      opacity = 0.7,
      title = paste(season_name, "Conditions"),
      position = "topright",
      labFormat = labelFormat(transform = convert_cond_col_to_text)
    )
  
  return(plot)
}

spring_palette = c("#79f520", "#6aba2d", "#357a18", "#2c6314")
summer_palette = c("#f5f120", "#bab72d", "#7a7818", "#636214")
autumn_palette = c("#f56b20", "#ba5e2d", "#7a3918", "#632c14")
winter_palette = c("#20d9f5", "#2daaba", "#18737a", "#0e3b40")

plot_conditions_distribution_for_season("Spring", spring_palette)
plot_conditions_distribution_for_season("Summer", summer_palette)
plot_conditions_distribution_for_season("Autumn", autumn_palette)
plot_conditions_distribution_for_season("Winter", winter_palette)
rm(spring_palette, summer_palette, autumn_palette, winter_palette)
rm(data_wt_cond_4, data_wt_cond, wt_cols, cond_cols)

We can conclude this section with a few observations:

  • during spring, only the Rain and Partially Cloudy weather are dominant in Romania;
  • during summer and autumn, there is no county having a dominant Overcast weather;
  • winter is the only season where the Overcast weather is dominant, even withing 10 counties;
  • every season, Romania is dominated by a Partially Cloudy weather;
  • Cluj and Sibiu have presented a dominant Rain weather during all of the seasons.

3.2. Exploring the Temperatures

In this section we are going to look at the first thing that comes to one’s mind when talking about weather: Temperature.

We will first look at the distribution of the minimum, maximum and average daily temperature across all Romania’s counties during the entire period 2011-2021. This result is achieved by averaging the temperatures for each day across all the 42 counties, and then displaying them into the violin plot from below.

data_temp_1 = data %>%
  select(Datetime, MinimumTemperature, MaximumTemperature, Temperature) %>%
  group_by(Datetime) %>%
  summarise(
    Minimum = mean(MinimumTemperature, na.rm = TRUE),
    Maximum = mean(MaximumTemperature, na.rm = TRUE),
    Average = mean(Temperature, na.rm = TRUE)
  ) %>%
  select(-Datetime) %>%
  gather(key = "Temperature", value = "Value") %>%
  mutate(Temperature = factor(Temperature, levels=c("Minimum", "Average", "Maximum")))

plot_temp_1 = ggplot(data_temp_1, aes(x=Temperature, y=Value, fill=Temperature)) +
  geom_violin(alpha=0.8) +
  labs(
    title = "Temperature distribution in Romania during 2011-2021",
    x = "Daily Temperatures",
    y = ""
  ) +
  scale_fill_manual(
    name = "",  # remove legend title
    values = c("#1922d4", "#d4cb19", "#d41919"),
    labels = c("Minimum", "Average", "Maximum")
  ) +
  scale_y_continuous(
    breaks = seq(-20, 40, 5),
    limits = c(-20, 40),
    labels = function (x) paste(x, "°C")
  ) +
  theme(plot.title = element_text(hjust = 0.5))

ggplotly(plot_temp_1, tooltip = "text")
rm(plot_temp_1, data_temp_1)

A few of the notable aspects to observe in this plot are the following:

  • as expected, minimum temperatures present lower values than the average ones, which in turn are lower than the maximum ones;
  • all of the 3 distributions tend to present 2 modes:
    • daily minimums get more dense around 1 °C and 15 °C;
    • daily averages get more dense around 4 °C and 20 °C;
    • daily maximums get more dense around 12.5 °C and 26 °C;

A reason for the appearance of two modes in each distribution can be given by the extreme temperature differences between summer and winter seasons. If we analyzed the temperatures distribution for each season, we would find without any surprise significant differences, which is what we have done in the following plot.

data_temp_2 = data %>%
  select(Season, Datetime, MinimumTemperature, MaximumTemperature, Temperature) %>%
  group_by(Datetime, Season) %>%
  summarise(
    Minimum = mean(MinimumTemperature, na.rm = TRUE),
    Maximum = mean(MaximumTemperature, na.rm = TRUE),
    Average = mean(Temperature, na.rm = TRUE)
  ) %>%
  ungroup() %>%
  select(-Datetime) %>%
  gather(key = "Temperature", value = "Value", 2:4) %>%
  mutate(Temperature = factor(Temperature, levels=c("Minimum", "Average", "Maximum"))) %>%
  mutate(Season = factor(Season, levels=c("Spring", "Summer", "Autumn", "Winter")))

plot_temp_2 = ggplot(data_temp_2, aes(x=Temperature, y=Value, fill=Temperature)) +
  geom_violin(alpha=0.8) +
  facet_wrap(~Season) +
  labs(
    title = "Temperature distribution in Romania by seasons",
    x = "Daily Temperatures",
    y = ""
  ) +
  scale_fill_manual(
    name = "",  # remove legend title
    values = c("#1922d4", "#d4cb19", "#d41919"),
    labels = c("Minimum", "Average", "Maximum")
  ) +
  scale_y_continuous(
    breaks = seq(-20, 40, 5),
    limits = c(-20, 40),
    labels = function (x) paste(x, "°C")
  ) +
  theme(plot.title = element_text(hjust = 0.5))

ggplotly(plot_temp_2, tooltip = "text")

rm(data_temp_2, plot_temp_2)

Looking at other time frames, we decided to look at the month level, and compare the daily average temperatures for each month in Romania over the last 11 years. This way we can conclude which would usually be the hottest/coldest months in Romania.

library(ggridges)

rm(plot_temp_2, data_temp_2)

data_temp_3 = data %>%
  mutate(Month = factor(Month, levels = rev(months_names)))

ggplot(data_temp_3, aes(x = Temperature, y = Month, fill = ..x..)) +
  geom_density_ridges_gradient(scale = 3, rel_min_height = 0.01) +
  scale_fill_viridis_c(option = "C") +
  theme(
    legend.position="none",
    plot.title = element_text(hjust = 0.5)
  ) +
  scale_x_continuous(
    breaks = seq(-20, 32.5, 5),
    limits = c(-20, 32.5),
    labels = function (x) paste(x, "°C")
  ) +
  labs(
    title = "Temperature distribution in Romania by months (2011-2021)",
    x = "Daily Average Temperatures",
    y = ""
  )

rm(data_temp_3)

Judging by the ridgeline plot from above, we can tell that August is usually the hottest month of the year, with a mean of around 23 °C, while January can represent the coldest month of the year with a mean of around 0 °C, having slightly more lower temperatures than in December.

Keeping in mind that we can investigate temperatures at a month level, we tried to go back to Minimums, Maximums and Averages, and compute their means for the entire country and depict them in the grouped barplot below.

data_temp_4 = data %>%
  group_by(Month) %>%
  summarise(
    Minimum = mean(MinimumTemperature, na.rm = TRUE),
    Maximum = mean(MaximumTemperature, na.rm = TRUE),
    Average = mean(Temperature, na.rm = TRUE)
  ) %>%
  gather(key = "Temperature", value = "Value", Minimum, Maximum, Average) %>%
  mutate(Temperature = factor(Temperature, levels = c("Maximum", "Average", "Minimum"))) %>%
  mutate(Month = factor(Month, rev(months_names)))

bar_texts = sprintf(
    "<b>%s</b>\nMean of %ss: <em>%s °C</em>",
    data_temp_4$Month,
    data_temp_4$Temperature,
    round(data_temp_4$Value, 2)
  ) %>% 
  lapply(htmltools::HTML)

plot_temp_4 = ggplot(
    data_temp_4,
    aes(
      x = Month,
      y = Value,
      fill = Temperature,
      text = bar_texts
    )
  ) +
  geom_bar(position = "dodge", stat = "identity", alpha = 0.7) +
  coord_flip() +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_y_continuous(
    breaks = seq(-5, 30, 5),
    limits = c(-5, 30),
    labels = function (x) paste(x, "°C")
  ) +
  scale_fill_manual(
    name = "",  # remove legend title
    values = c("#d41919", "#d4cb19", "#1922d4"),
    labels = c("Minimum", "Average", "Maximum")
  ) +
  labs(
    title = "Mean Temperatures in Romania by months (2011-2021)",
    y = "Daily Temperatures Means",
    x = ""
  )

ggplotly(plot_temp_4, tooltip = "text")
rm(plot_temp_4, data_temp_4, bar_texts)

As confirmed in this plot as well, August contains the highest Maximum and Average temperatures, while July has the highest Minimum temperatures mean (16.15 °C), which is slightly higher than August (15.89 °C). On the other side, the lowest Minimum, Maximum and Average means are registered during January.

Before going into the maps, one last thing we want to show is the evolution of the temperatures in Romania across the 11 years, on a monthly basis. For this, we have created the line plot down below.

data_temp_5 = data %>%
  select(Year, Month, MinimumTemperature, MaximumTemperature, Temperature) %>%
  group_by(Year, Month) %>%
  summarise(
    Minimum = mean(MinimumTemperature, na.rm = TRUE),
    Maximum = mean(MaximumTemperature, na.rm = TRUE),
    Average = mean(Temperature, na.rm = TRUE)
  ) %>%
  gather(key = "Temperature", value = "Value", Minimum, Maximum, Average) %>%
  ungroup() %>%
  mutate(Temperature = factor(Temperature, levels=c("Minimum", "Average", "Maximum"))) %>%
  mutate(MonthText = paste(Month, Year)) %>%
  mutate(Month = as.Date(paste0(Month, "-01-", Year), format = "%B-%d-%Y")) %>%
  select(-Year)

temp_texts = sprintf(
    "<b>%s</b>\n%ss Mean: <em>%s °C</em>",
    data_temp_5$MonthText,
    data_temp_5$Temperature,
    round(data_temp_5$Value, 2)
  ) %>% 
  lapply(htmltools::HTML)

get_label_from_date_break = function(date_break) {
  month_text = compute_month_from_date(date_break, 6)
  year_text = compute_year_from_date(date_break, 1)
  
  return(paste(month_text, year_text))
}

plot_temp_5 = ggplot(
    data_temp_5,
    aes(
      x = Month,
      y = Value,
      group = Temperature,
      color = Temperature,
      text = temp_texts
    )
  ) +
  geom_line(size = 0.5) +
  stat_smooth(
    geom = "line",
    method ='lm',
    formula = y ~ x,
    size = 0.5,
    se = F,
    alpha = 0.25
  ) +
  labs(
    title = "Monthly evolution of temperatures in Romania",
    x = "",
    y = ""
  ) +
  scale_y_continuous(
    breaks = seq(-15, 35, 5),
    limits = c(-15, 35),
    labels = function (x) paste(x, "°C")
  ) +
  scale_x_date(
    date_breaks = "1 year",
    limits = c(as.Date("2010-08-01"), as.Date("2021-12-01")),
    labels = get_label_from_date_break
  ) +
  scale_color_manual(name = "",values = c("#1922d4", "#d4cb19", "#de0000")) +
  geom_point(size=0.5) +
  theme(
    axis.text.x = element_text(angle = 45),
    plot.title = element_text(hjust = 0.5)
  )
  
ggplotly(plot_temp_5, tooltip = "text")

rm(data_temp_5, plot_temp_5, temp_texts)

In the above line plot, we have also depicted three regression lines, which show us a slow ascending trend over the years for all three temperature types. This can also confirm that Romania is affected by the global warming, a well known phenomenon that affects multiple countries leading to higher and higher temperatures.

The next step in this section consists of looking deeper into the temperatures and depict them depending on the counties. We expect lower temperatures to be registered around the mountains and higher ones in counties situated in plain relief.

data_temp_6 = data %>%
  select(County, MinimumTemperature, Temperature, MaximumTemperature) %>%
  group_by(County) %>%
  summarise(
    Minimum = min(MinimumTemperature, na.rm = TRUE),
    Average = mean(Temperature, na.rm = TRUE),
    Maximum = max(MaximumTemperature, na.rm = TRUE),
  )

date_temp_min_datetimes_6 = data %>%
  select(County, MinimumTemperature, Datetime) %>%
  group_by(County) %>%
  filter(MinimumTemperature == min(MinimumTemperature, na.rm = TRUE)) %>%
  group_by(County, MinimumTemperature) %>%
  summarise(
    MinTempDatetime = first(Datetime)
  ) %>%
  ungroup() %>%
  select(County, MinTempDatetime) %>%
  mutate(MinTempDatetime = format(as.Date(MinTempDatetime, format = "%m/%d/%Y"), "%d %b %Y"))

date_temp_max_datetimes_6 = data %>%
  select(County, MaximumTemperature, Datetime) %>%
  group_by(County) %>%
  filter(MaximumTemperature == max(MaximumTemperature, na.rm = TRUE)) %>%
  group_by(County, MaximumTemperature) %>%
  summarise(
    MaxTempDatetime = first(Datetime)
  ) %>%
  ungroup() %>%
  select(County, MaxTempDatetime) %>%
  mutate(MaxTempDatetime = format(as.Date(MaxTempDatetime, format = "%m/%d/%Y"), "%d %b %Y"))

data_temp_6 = data_temp_6 %>%
  left_join(date_temp_min_datetimes_6) %>%
  left_join(date_temp_max_datetimes_6)

map_temp_6 = romania_spdf

map_temp_6@data = map_temp_6@data %>%
  left_join(data_temp_6, by = c("name" = "County"))

bins = c(4, 6.5, 8.5, 10, 11, 12, 13, 14)
temp_palette = colorBin("YlOrRd", domain = map_temp_6$Average, bins = bins)

temp_labels = sprintf(
  "<strong>%s</strong><br/>
  Average Temperature: <strong><em>%s °C</em></strong><br/>
  Minimum: <strong><em>%s °C</em></strong> <em>(%s)</em><br/>
  Maximum: <strong><em>%s °C</em></strong> <em>(%s)</em>",
  map_temp_6$name,
  round(map_temp_6$Average, 2),
  round(map_temp_6$Minimum, 2),
  map_temp_6$MinTempDatetime,
  round(map_temp_6$Maximum, 2),
  map_temp_6$MaxTempDatetime
) %>% lapply(htmltools::HTML)

leaflet(
    map_temp_6,
    options = leafletOptions(zoomSnap = 0.5, zoomDelta = 0.5)
  ) %>%
  setView(25, 46, 6.5) %>%
  addProviderTiles(providers$OpenStreetMap.HOT) %>%
  addPolygons(
    fillColor = ~temp_palette(Average),
    opacity = 1,
    color = "#dbdbdb",
    weight = 2,
    fillOpacity = 0.7,
    highlightOptions = highlightOptions(
      weight = 4,
      color = "#6b6b6b",
      fillOpacity = 0.7,
      bringToFront = TRUE
    ),
    label = temp_labels,
    labelOptions = labelOptions(
      style = list("font-weight" = "normal", padding = "3px 8px"),
      textsize = "15px",
      direction = "auto"
    )
  ) %>%
  addLegend(
    pal = temp_palette,
    values = ~Average,
    opacity = 0.7,
    title = "Temperature (°C)",
    position = "topright"
  )
rm(data_temp_6, date_temp_min_datetimes_6, date_temp_max_datetimes_6, map_temp_6, bins, temp_palette, temp_labels)

Another way to look at the temperature distributions among counties is by month. In order to study from this perspective, we decided to use a heatmap and depict the average temperature in each county for each month among the 11 observed years.

library(heatmaply)

data_temp_7 = data %>%
  select(County, Month, Temperature) %>%
  group_by(County, Month) %>%
  summarise(AverageTemperature = round(mean(Temperature, na.rm = TRUE), 2)) %>%
  ungroup() %>%
  arrange(County) %>%
  mutate(County = factor(County))

matrix_temp_7 = matrix(data_temp_7$AverageTemperature, nrow = 42, ncol = 12, byrow = TRUE)
rownames(matrix_temp_7) = levels(data_temp_7$County)
colnames(matrix_temp_7) = levels(data_temp_7$Month)

# Sort the matrix rows by the temperatures registered in August (hottest month)
matrix_temp_7 = matrix_temp_7[order(matrix_temp_7[,8], decreasing = TRUE),]

heatmaply(matrix_temp_7,
  dendrogram = "none",
  xlab = "",
  ylab = "", 
  main = "",
  margins = c(60,100,40,20),
  grid_color = "white",
  grid_width = 0.00001,
  titleX = FALSE,
  hide_colorbar = TRUE,
  branches_lwd = 0.1,
  label_names = c("County", "Month", "Average Temperature"),
  label_format_fun = function(x) paste(round(x, 2), "°C"),
  fontsize_row = 8,
  fontsize_col = 8,
  labCol = colnames(matrix_temp_7),
  labRow = rownames(matrix_temp_7),
  heatmap_layers = theme(axis.line=element_blank()),
  colors = plasma
)
rm(data_temp_7, matrix_temp_7)

3.3. Exploring the Winds

Another aspect to look into when analyzing the weather of a country is the winds. Here we will look into the direction from where the wind is coming mostly in Romania.

data_wind_1 = data %>%
  select(WindCardinalDirection) %>%
  filter(!is.na(WindCardinalDirection)) %>%
  group_by(WindCardinalDirection) %>%
  count(name = "Frequency") %>%
  ungroup() %>%
  mutate(Frequency = round(Frequency / sum(Frequency) * 100, 2)) %>%
  mutate(Id = 1:n())

label_data = data_wind_1

bars_count = nrow(label_data)
angle = 90 - 360 * (label_data$Id - 0.5) / bars_count
label_data$Hjust = ifelse(angle < -90, 1, 0)
label_data$Angle = ifelse(angle < -90, angle + 180, angle)

ggplot(data_wind_1, aes(x = WindCardinalDirection, y = Frequency)) +
  geom_bar(stat="identity", fill=alpha("blue", 0.6)) +
  ylim(-5, 13) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5),
    axis.text = element_blank(),
    axis.title = element_blank(),
    panel.grid = element_blank(),
  ) +
  coord_polar(start = -0.0625 * pi) +  # Center the bars accordingly to N/E/S/W
  ggtitle("Distribution of Wind Directions in Romania (2011-2021)") +
  geom_text(
    data = label_data,
    aes(x = Id, y = Frequency + 1, label = WindCardinalDirection, hjust = Hjust),
    color = "black",
    fontface = "bold",
    alpha = 0.6,
    size = 5,
    angle = label_data$Angle,
    inherit.aes = FALSE
  )

rm(data_wind_1, label_data, bars_count, angle)

Looking at the circular barplot from above, we can easily tell that the winds entering Romania come mostly from the South-Western part, while they rarely come from North.

Another property of the wind is its speed, so we can also take a look at how fast the winds can get in our country.

data_wind_2 = data %>%
  select(WindSpeed) %>%
  filter(!is.na(WindSpeed))

median = median(data_wind_2$WindSpeed, na.rm = TRUE)
mean = mean(data_wind_2$WindSpeed, na.rm = TRUE)

ggplot(data_wind_2, aes(x = WindSpeed)) +
  geom_histogram(binwidth = 3, fill = "#0059ff", color = "#1400c9", alpha = 0.8) +
  ggtitle("Distribution of Wind Speed") +
  theme(plot.title = element_text(hjust = 0.5)) +
  labs(y = "Occurrences", x = "Wind Speed (km/h)") +
  xlim(0, 50) +
  scale_x_continuous(
    breaks = seq(0, 50, 2),
    limits = c(0, 50)
  ) +
  geom_vline(
    xintercept = median,
    linetype = "dashed",
    color = "red",
    size = 1
  ) +
  annotate(geom = "text", x = median - 2.2, y = 37500, label = "Median", color = "red") +
  geom_vline(
    xintercept = mean,
    linetype = "dashed",
    color = "black",
    size = 1
  ) +
  annotate(geom = "text", x = mean + 2, y = 36000, label = "Mean", color = "black")

rm(data_wind_2)

In the Wind Speed’s histogram, we can tell that the most observed speeds are around 11 km/h - 18 km/h, with a mean of ~17 km/h and a median value of 14.5 km/h.

A possible reason for which the mean is slightly bigger than the median is given by other outliers that reside at the right wing of the histogram, besides the ones that are visible in this plot. Other values that have been cut out of the plot can go in bigger ranges like 150 km/h - 200 km/h.

3.4. Exploring the Precipitations and their effects

Precipitation is one of the three main processes (evaporation, condensation, and precipitation) that constitute the hydrologic cycle, the continual exchange of water between the atmosphere and Earth’s surface. Water evaporates from ocean, land, and freshwater surfaces, is carried aloft as vapour by the air currents, condenses to form clouds, and ultimately is returned to Earth’s surface as precipitation. (Britannica.com)

We want to see the average precipitation level of each county in each month to measure the potential rainy month for each county. We use heatmap for this.

data_temp_8 = data %>%
  select(County, Month, Precipitation, SeaLevelPressure) %>%
  group_by(County, Month) %>%
  summarise(AveragePrecipitation = round(mean(Precipitation, na.rm = TRUE), 2)) %>%
  ungroup() %>%
  arrange(County) %>%
  mutate(County = factor(County))

matrix_temp_8 = matrix(data_temp_8$AveragePrecipitation, nrow = 42, ncol = 12, byrow = TRUE)
rownames(matrix_temp_8) = levels(data_temp_8$County)
colnames(matrix_temp_8) = levels(data_temp_8$Month)

matrix_temp_8 = matrix_temp_8[order(matrix_temp_8[,8], decreasing = TRUE),]

heatmaply(matrix_temp_8,
  dendrogram = "none",
  xlab = "",
  ylab = "", 
  main = "",
  margins = c(60,100,40,20),
  grid_color = "white",
  grid_width = 0.00001,
  titleX = FALSE,
  hide_colorbar = TRUE,
  branches_lwd = 0.1,
  label_names = c("County", "Month", "Average Precipitation"),
  label_format_fun = function(x) paste(round(x, 2), "°C"),
  fontsize_row = 8,
  fontsize_col = 8,
  labCol = colnames(matrix_temp_8),
  labRow = rownames(matrix_temp_8),
  heatmap_layers = theme(axis.line=element_blank()),
  colors = plasma
)
rm(data_temp_8, matrix_temp_8)

According to the heatmap, June has the highest level of precipitation throughout the year in average in Romania. We can comment that June is the most potential rainy month in Romania.

Moreover, we wanted to see the correlation between the relative columns with precipitation.

library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
ggpairs(
  data[, c(13,14,15,16,17,18)],
  upper = list(continuous = "density", combo = "box_no_facet"),
  lower = list(continuous = "points", combo = "dot_no_facet")
)

4. Making sense of the Weather

In this part we will search about how each factor is relevant to other factors. For example how high temperature is relevant to DewPoint, etc.

4.1. Surface connections

library(PerformanceAnalytics)

chart.Correlation(data[c(5,6,7,8,9,13,18)], histogram = TRUE, method = "pearson")

The highest correlation is between temperature and DewPoint as 93% while the lowest is between Precipitation and Heatindex by 0.59%.